Hierarchical clustering of large text datasets using Locality-Sensitive Hashing

نویسندگان

  • Vasilii Korelin
  • Ivan Blekanov
چکیده

In this paper, we present a hierarchical clustering algorithm of the large text datasets using Locality-Sensitive Hashing (LSH). The main idea of the LSH is to “hash” items several times, in such a way that similar items are more likely to be hashed to the same bucket than dissimilar are. The main drawback of the conventional hierarchical algorithms is a large time complexity (e.g. Single Linkage method has time complexity of ( )) Proposed algorithm reduces the time complexity to ( ). Here, represents the maximum number of items going to the single bucket. is a small constant as compared to n for the large number of buckets. Clustering results of the hierarchical clustering algorithm, that uses LSH, are similar to the clustering results of the classical single linkage method. The main advantage of the hierarchical clustering algorithm, that uses LSH, is a significant increase in speed for large datasets clustering in comparison with classical algorithms.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient Clustering of Metagenomic Sequences using Locality Sensitive Hashing

The new generation of genomic technologies have allowed researchers to determine the collective DNA of organisms (e.g., microbes) co-existing as communities across the ecosystem (e.g., within the human host). There is a need for the computational approaches to analyze and annotate the large volumes of available sequence data from such microbial communities (metagenomes). In this paper, we devel...

متن کامل

Survey of Hashing Techniques for Compact Bit Representations of Images

Binary encoding schemes that preserve similarity in large collections of images are required for faster retrieval and effective storage. There have been many recent hashing techniques that produce semantic binary representations. This paper presents a survey of such hashing techniques that allows faster nearest neighbor search in hamming space. Specifically, approaches that use locality-sensiti...

متن کامل

Scalable Locality-Sensitive Hashing for Similarity Search in High-Dimensional, Large-Scale Multimedia Datasets

Similarity search is critical for many database applications, including the increasingly popular online services for Content-Based Multimedia Retrieval (CBMR). These services, which include image search engines, must handle an overwhelming volume of data, while keeping low response times. Thus, scalability is imperative for similarity search in Webscale applications, but most existing methods a...

متن کامل

Learning of Invariant Object Recognition in a Hierarchical Network

In this paper we propose an object recognition system implementing three basic principles: forming of temporal groups of features, learning in a hierarchical structure and using feedback for predicting future input. It gives very good results on public available datasets. Precondition for successful learning is that training images are presented to the system in an appropriate order such that i...

متن کامل

Scalable Protein Sequence Similarity Search using Locality-Sensitive Hashing and MapReduce

Metagenomics is the study of environments through genetic sampling of their microbiota. Metagenomic studies produce large datasets that are estimated to grow at a faster rate than the available computational capacity. A key step in the study of metagenome data is sequence similarity searching which is computationally intensive over large datasets. Tools such as BLAST require large dedicated com...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015